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Abstract: The effective use of humanoid robots in space will depend upon 
the efficacy of interaction between humans and robots. The key to 
achieving this interaction is to provide the robot with sufficient skills for 
natural communication with humans so that humans can interact with the 
robot almost as though it were another human. This requires that a number 
of basic capabilities be incorporated into the robot, including voice 
recognition, natural language, and cognitive tools on-board the robot to 
facilitate interaction between humans and robots through use of common 
representations and shared humanlike behaviors. 
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1. INTRODUCTION 

Humanoid robots are now being built to assist 
humans in a variety of activities. The Robonaut 
platfonn (Ambrose, et al., 2000) is a humanoid robot 
designed to assist human astronauts working in space 
(Figure 1). Demonstrated teleoperative capabilities 
of Robonaut include dexterous grasping, tool 
handling, and cooperation with human teammates on 
manual tasks that require both information exchange 
and physical interaction between robot and human. 
Current development efforts are focused upon the 
implementation of autonomous behaviors and 
capabilities for Robonaut. 

Effective collaboration between robots and humans 
requires the use of an efficient interface whereby a 
human can communicate and interact with a robot 
almost as easily as with another human. In this 
interaction the human may act as a supervisor and/or 


collaborator with the robot. Human-robot 
collaboration is facilitated by a number of 
capabilities built into the interface and the robot 
itself, including voice recognition, natural language 
and gesture understanding, and behaviors supporting 
dynamic autonomy (Sofge, et al., 2003). The 
inclusion of cognitively plausible representations and 
processes incorporated into the robot provides a 
further basis for facilitating collaboration between 
humans and robots, thereby reducing the human 
effort required to adapt to limitations of the robot as 
a non-human collaborator. Use of a cognitive model 
aboard the robot facilitates better communication and 
interaction between the human and the robot through 
use of a common representational framework for the 
environment and objects within it, processing of 
sensor infonnation, and joint problem solving 
involving both humans and robots. 
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In this paper we focus on the use of cognitive models 
of spatial reasoning capabilities aboard Robonaut to 
enhance communications between human astronauts 
and Robonaut, thereby enabling collaborative teams 
consisting of human astronauts and humanoid robots. 


2. COGNITIVE HUMANOID ROBOTS 

Achieving effective collaboration between humans 
and robots will require the use of cognitive models 
on-board the robots. Embodied cognition, as we call 
it, uses cognitive models of human performance to 
augment a robot’s reasoning capabilities, and 
facilitates human-robot interaction in two ways. 
First, we hypothesize that the more a robot behaves 
like a human being, the easier it will be for humans 
to predict and understand its behavior and interact 
with it. Second, if humans and robots share at least 
some of the representational structures in their 
interactive communications or activities, we further 
hypothesize that communication among them will be 
much easier. For example, in tasks requiring 
direction generation, humans naturally use 
qualitative spatial relationships (Miller and Johnson- 
Laird, 1976; Tversky, 1993) such as “left,” “up”, 
“east,” or “north.” Interaction with a robot capable 
of manipulating the same representation instead of 
traditional real number matrices would be more 
natural and efficient. In (Bugajska, et al., 2002) and 
(Trafton, et al., 2003) we used cognitive models of 
human performance of the task to augment the 
capabilities of robotic systems. 

We are investigating the use of two cognitive 
architectures based on human cognition for certain 
high-level control mechanisms aboard Robonaut. 
These cognitive architectures are ACT-R/S (Harrison 
and Schunn, 2003) based upon the ACT-R 
architecture (Anderson and Lebiere, 1998) and 
Polyscheme (Cassimatis, 2002). 

ACT-R is one of the most prominent cognitive 
architectures to have emerged in the past two 
decades as a result of the information processing 
revolution in the cognitive sciences. Also called a 
unified theory of cognition, ACT-R is a relatively 
complete theory about the structure of human 
cognition that strives to account for the full range of 
cognitive behavior with a single, coherent set of 
mechanisms. Its chief computational claims are: 
first, that cognition functions at two levels, one 
symbolic and the other sub-symbolic; second, that 
symbolic memory has two components, one 
procedural and the other declarative; and third, that 
the sub-symbolic performance of memory is 
optimized in response to the statistical structure of 
the environment. These theoretical claims are 
implemented as a production-system modeling 
environment. The theory has been successfully used 
to account for human performance data in a wide 





Fig. 1. Robonaut-NASA’s Humanoid Robot 

variety of domains including memory for goals 
(Altmann and Trafton, 2002), human computer 
interaction (Anderson, et al., 1997), and scientific 
discovery (Schunn and Anderson, 1998). 

The ACT-R/S system uses three different spatial 
represent-tations suggested in (Harrison and Schunn, 
2003). Briefly, they suggest that people use a focal 
representation for object identification that consists 
of non-metric geons, a manipulative representation 
for grasping and tracking that consists of metric 
geons, and a configural representation for navigation 
that consists of rectangular bounding regions. While 
a coarse representation is adequate for obstacle 
avoidance in navigation, in order to do perspective 
taking, we must spatially transform an object, 
focusing on the configural representation to 
determine that object’s spatial references. These 
transformations must then be mapped onto the user’s 
representations so that actions can be performed. We 
use ACT-R/S to create cognitively plausible models 
of human performance of tasks to be performed by 
the robots. 

Furthermore, we are using Cassimatis’ Polyscheme 
architecture (Cassimatis, 2002) for spatial, temporal 
and physical reasoning. The Polyscheme cognitive 
architecture enables multiple representations and 
algorithms (including ACT-R models), encapsulated 
in “specialists” to be integrated into inferences that 
agents make about a situation or about possible 
situations. Finally, we use an updated version of the 
Polyscheme implementation of a physical reasoner to 
help keep track of the robot’s physical environment. 

2.1 Perspective-Taking 

One of the features of human cognition that 
facilitates natural human-robot interaction is 
“perspective-taking”. In order to understand 
utterances such as “the wrench on my left,” the robot 
must be able to reason from the perspective of the 
speaker to resolve the meaning of “my left”. We will 


use the Polyscheme architecture and ACT-R models 
to endow the robot with the ability to conceive of 
task-oriented goals and knowledge of another person. 
This will allow the robot to more easily predict and 
explain its own behavior, as well as the behavior of 
others, making it a better partner in a collaborative 
activity. 

Polyscheme has a simulation mechanism, called a 
“world” in which the robot is endowed with 
perspective-taking capabilities. Polyscheme allows 
the robot to reason about what it sees in its 
immediate environment from different perspectives. 
Using worlds, Polyscheme can simulate the 
perspective it would have at other times, different 
places and in other hypothetical worlds, and use its 
specialists to make inferences within those 
perspectives. Polyscheme uses reasoning algorithms 
such as counterfactual reasoning, backtracking 
search, truth-maintenance and stochastic simulation. 
We have created a specialist to reason about the 
perspective(s) of other people. This allows 
Polyscheme to predict and explain other people’s 
behavior, using its perceptual, motor, procedural, 
memory, spatial and physical specialists from the 
perspective of another person. 

For both ACT-R/S and Polyscheme we have created 
preliminary models that can perform simple spatial 
perspective-taking tasks. There seem, however, to 
be advantages and disadvantages to both systems: 
For example, ACT-R/S has more difficulty doing 
large scale simulations, but has a large amount of 
historical cognitive plausibility (e.g., there have been 
a large number of empirical and psychological 
studies validating ACT-R/S), while Polyscheme has 
comparatively less of a cognitive history. 
Additionally, because the representations and 
operations of each system are a bit different, their 
behaviors are different and various tasks may be 
easier or more straightforward to model for one 
system than for another. 


3. NATURAL LANGUAGE UNDERSTANDING 

While ACT-R/S and Polyscheme enable us to 
embody a common cognitive model for tasks and 
spatial reasoning, we employ a natural language and 
gesture understanding system by which a human and 
robot can communicate this information to each 
other. Our rationale here is that the use of as natural 
an interface as possible again facilitates interaction. 
Our natural language interface combines a 
commercial speech recognition front-end with an in- 
house developed deep parsing system, NAUTILUS 
(Wauchope, 1994). ViaVoice™ is used to translate 
the speech signal into text, which is then passed to 
our natural language understanding system, Nautilus, 
to produce both syntactic and semantic 
interpretations. The semantic interpretation, 


interpreted gestures from the vision system, and 
command inputs from the computer or other 
interfaces are compared, matched and resolved in the 
command interpretation system. 

Using our multimodal interface (Figure 2) the human 
user can interact with the robot using both natural 
language and gestures. The semantic interpretation 
is linked, where necessary, to gesture information via 
the Gesture Interpreter, Goal Tracker/Spatial 
Relations component, and Appropriateness/Need 
Filter, and an appropriate robot action or response 
results. 



For example, the human user can ask the robot “How 
many objects do you see?” ViaVoice™ analyzes the 
speech signal, producing a text string. NAUTILUS 
parses the string and produces a representation 
something like the following, simplified here for 
expository purposes. 

(ASKWH (1) 

(MANY N3 (:CLASS OBJECT) PLURAL) 
(PRESENT #:V7791 
(:CLASS P-SEE) 

(:AGENT (PRON N1 (:CLASS SYSTEM)YOU)) 
(:THEME N3))) 

The parsed text string is mapped into a kind of 
semantic representation (1). The various verbs or 
predicates of an utterance (e.g. see) are mapped into 
corresponding semantic classes (p-see) that have 
particular argument structures {agent, theme)', for 
example “you” is the agent of the p-see class of verbs 
in this domain and “objects” is the theme of this 
verbal class, represented as “N3”—a kind of co¬ 
indexed trace element in the theme slot of the 
predicate, since this element is syntactically fronted 
in English wh-questions. If the spoken utterance 
requires a gesture for disambiguation, as in for 





























example the sentence “Look over there,” the gesture 
components obtain and send the appropriate gesture 
to the Goal Tracker component which combines 
linguistic and gesture information. 


4. SPATIAL LANGUAGE 

As human operators we often think in terms of the 
relative spatial positions of objects, and we use such 
relational linguistic tenninology naturally in 
communicating with our human colleagues. For 
example, a speaker might say, “Hand me the wrench 
on the table.” If the assistant cannot find the wrench, 
the speaker might say, “The wrench is to the left of 
the toolbox.” The assistant need not be given precise 
coordinates for the wrench but can look in the area 
specified using the spatial relational terms. 

In a similar manner, this type of spatial language can 
be helpful for intuitive communication with a robot 
in many situations. Relative spatial terminology can 
be used to limit a search space by focusing attention 
in a specified region, as in “Look to the left of the 
toolbox and find the wrench.” It can be used to issue 
robot commands, such as “Pick up the wrench on the 
table.” A sequential combination of such directives 
can be used to describe and issue a high level task, 
such as, “Find the toolbox on the table behind you. 
The wrench is on the table to the left of the toolbox. 
Pick it up and bring it back to me.” Finally, spatial 
language can also be used by the robot to describe its 
environment, thereby providing a natural linguistic 
description of the environment, such as, “There is a 
wrench on the table to the left of the toolbox.” 

In all of these cases the spatial language increases the 
dynamic autonomy of the system by giving the 
human operator a less restrictive vernacular for 
communicating with the robot and vice versa. 
However, the examples above also assume some 
level of object recognition by the robot. 

To address the object recognition problem, we use 
natural language to address spatial relations, thereby 
assisting humans interacting with robots in 
recognition and labeling of objects (Skubic, et al,. 
2002). Given our natural language interface, a human 
can easily communicate with the robot about objects 
and spatial relations in the environment through the 
use of a dialog that is easy and natural for the human. 
Furthermore, once an object is labeled, the user can 
then issue additional commands using natural spatial 
terms and by referencing the named object. An 
example is given in (2): 

Human: “How many objects do you see?” (2) 

Robot: “I see 4 objects.” 

Human: “Where are they located?” 


Robot: “There are two objects in front of me, 

one object on my right, and one object 
behind me.” 

Human: “The nearest object in front of you is a 
toolbox. Place the wrench to the left 
of the toolbox.” 

Establishing a common frame is necessary so that it 
is clear what is meant by spatial references generated 
both by the human operator as well as by the robot. 
Thus, if the human commands the robot, “Turn left,” 
the robot must know whether the operator is referring 
to the robot’s left or the operator’s left. Likewise, in 
a human-robot dialog, if the robot places a second 
object “just to the left of the first object,” both need 
to know if the goal location is to the left of the robot 
or the human. 

Currently, commands using spatial references (e.g., 
“Go to the right of the table”) assume an extrinsic 
reference frame of the object (table) and are based on 
the robot’s viewing perspective to be consistent with 
Grabowski’s “outside perspective” (Grabowski, 
1999). That is, the spatial reference assumes the 
robot is facing the referent object. 

Although there has been considerable research on the 
linguistics of spatial language for humans, there has 
been only limited work done in using spatial 
language for interacting with robots. Some 
researchers have proposed a framework for such an 
interface (Muller, et al., 2000). Moratz (2001) 
investigated the spatial references used by human 
users to control a mobile robot. An interesting 
finding is that the test subjects consistently used the 
robot’s perspective when issuing directives, in spite 
of the 180-degree rotation. At first, this may seem 
inconsistent with human to human communication. 
However, in human to human experiments, Tversky 
(1999) observed a similar result and found that 
speakers took the listener’s perspective in tasks 
where the listener had a significantly higher 
cognitive load than the speaker. 

The experiments by Moratz (2001) provide rationale 
for using the robot’s viewing perspective. We are 
currently investigating this further through use of 
human-factors experiments where individuals who 
do not know the spatial reasoning capabilities and 
limitations of the robot provide instructions to the 
robot for performing various tasks where spatial 
referencing is required (Perzanowski, et al., 2003). 
The results of this study will be used to enhance the 
multimodal interface by establishing a common 
language for spatial referencing which incorporates 
those constructs and utterances most frequently used 
by untrained operators for commanding the robot. 



5. CONCLUSIONS 


Humanoid robots are being designed and built to 
provide assistance to humans in complex and 
challenging work environments, such as outer space. 
Achieving effective use of these humanoid robots in 
space will depend upon the difficulty of the tasks 
required of human astronauts interacting with robots. 
The use of cognitive tools aboard the robots provides 
a number of benefits, such as shared representations, 
behaviors, and modes of interaction between humans 
and robots, thereby easing the cognitive load on the 
part of the human. The key to achieving effective 
interaction is to provide the robot with sufficient 
skills for natural communication with humans so that 
humans can interact with the robot almost as easily 
as with another human 



Fig. 3. MIT’s Leonardo Robot (photo courtesy 
Cynthia Breazeal, © MIT Media Lab, 2002) 


This paper describes the design, implementation, and 
capabilities of a robotic system architecture for a 
robot which can be used (at some level) to 
collaborate with a human. The capabilities required 
of the robot include voice recognition, natural 
language understanding, gesture recognition, spatial 
reasoning, and cognitive modeling with perspective¬ 
taking. These represent a small subset of potential 
capabilities humans utilize with one another in 
collaborating to perform a task in a complex 
environment, and barely scratches the surface of 
capabilities we might want to build into an 
intelligent, collaborative robot. 

The capabilities described above have been 
successfully implemented and demonstrated on 
several mobile robotic platforms (Sofge, et al., 
2004), and we are now porting them to Robonaut. 
We are also extending the capabilities of the 
cognitive architectures (both ACT-R/S and 
Polyscheme) and their perspective-taking cognitive 
models. Future work will focus on enhancing the 
cognitive models through expanded rulesets and 
cognitively plausible behaviors and reasoning 
mechanisms, and adding learning capabilities to the 
models so that the robots may be able to acquire new 
knowledge and skills through interaction with 
humans and while performing tasks. Parts of this 
architecture have already been extended to several 
robots designed specifically for enhanced human 
interaction, such as MIT’s robot Leonardo (Breazeal, 
2003) (Figure 3). While Leonardo is not a humanoid, 
it is being developed with human-like characteristics 
and functionalities. We are also extending the 
architecture and methodology to include and study 
collaboration between teams of robots and humans. 
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